61 research outputs found
Robust distance correlation for variable screening
High-dimensional data are commonly seen in modern statistical applications,
variable selection methods play indispensable roles in identifying the critical
features for scientific discoveries. Traditional best subset selection methods
are computationally intractable with a large number of features, while
regularization methods such as Lasso, SCAD and their variants perform poorly in
ultrahigh-dimensional data due to low computational efficiency and unstable
algorithm. Sure screening methods have become popular alternatives by first
rapidly reducing the dimension using simple measures such as marginal
correlation then applying any regularization methods. A number of screening
methods for different models or problems have been developed, however, none of
the methods have targeted at data with heavy tailedness, which is another
important characteristics of modern big data. In this paper, we propose a
robust distance correlation (``RDC'') based sure screening method to perform
screening in ultrahigh-dimensional regression with heavy-tailed data. The
proposed method shares the same good properties as the original model-free
distance correlation based screening while has additional merit of robustly
estimating the distance correlation when data is heavy-tailed and improves the
model selection performance in screening. We conducted extensive simulations
under different scenarios of heavy tailedness to demonstrate the advantage of
our proposed procedure as compared to other existing model-based or model-free
screening procedures with improved feature selection and prediction
performance. We also applied the method to high-dimensional heavy-tailed RNA
sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer
cohort and RDC was shown to outperform the other methods in prioritizing the
most essential and biologically meaningful genes
Bayesian indicator variable selection of multivariate response with heterogeneous sparsity for multi-trait fine mapping
Variable selection has been played a critical role in contemporary statistics
and scientific discoveries. Numerous regularization and Bayesian variable
selection methods have been developed in the past two decades for variable
selection, but they mainly target at only one response. As more data being
collected nowadays, it is common to obtain and analyze multiple correlated
responses from the same study. Running separate regression for each response
ignores their correlation thus multivariate analysis is recommended. Existing
multivariate methods select variables related to all responses without
considering the possible heterogeneous sparsity of different responses, i.e.
some features may only predict a subset of responses but not the rest. In this
paper, we develop a novel Bayesian indicator variable selection method in
multivariate regression model with a large number of grouped predictors
targeting at multiple correlated responses with possibly heterogeneous sparsity
patterns. The method is motivated by the multi-trait fine mapping problem in
genetics to identify the variants that are causal to multiple related traits.
Our new method is featured by its selection at individual level, group level as
well as specific to each response. In addition, we propose a new concept of
subset posterior inclusion probability for inference to prioritize predictors
that target at subset(s) of responses. Extensive simulations with varying
sparsity and heterogeneity levels and dimension have shown the advantage of our
method in variable selection and prediction performance as compared to existing
general Bayesian multivariate variable selection methods and Bayesian fine
mapping methods. We also applied our method to a real data example in imaging
genetics and identified important causal variants for brain white matter
structural change in different regions.Comment: 29 pages, 3 figure
MEMD-ABSA: A Multi-Element Multi-Domain Dataset for Aspect-Based Sentiment Analysis
Aspect-based sentiment analysis is a long-standing research interest in the
field of opinion mining, and in recent years, researchers have gradually
shifted their focus from simple ABSA subtasks to end-to-end multi-element ABSA
tasks. However, the datasets currently used in the research are limited to
individual elements of specific tasks, usually focusing on in-domain settings,
ignoring implicit aspects and opinions, and with a small data scale. To address
these issues, we propose a large-scale Multi-Element Multi-Domain dataset
(MEMD) that covers the four elements across five domains, including nearly
20,000 review sentences and 30,000 quadruples annotated with explicit and
implicit aspects and opinions for ABSA research. Meanwhile, we evaluate
generative and non-generative baselines on multiple ABSA subtasks under the
open domain setting, and the results show that open domain ABSA as well as
mining implicit aspects and opinions remain ongoing challenges to be addressed.
The datasets are publicly released at \url{https://github.com/NUSTM/MEMD-ABSA}
Evaluation of Changes in the Characteristic Flavor of Ultra-high Temperature Sterilized Milk under the Effects of Temperature and Light
In order to study changes in the characteristic flavor of ultra-high temperature sterilized (UHT) milk under the influence of storage temperature and light, headspace solid phase microextraction (SPME) combined with gas chromatography-mass spectrometry (GC-MS) was used to detect the volatile flavor components of the product. Descriptive sensory evaluation, orthogonal partial least squares-discriminant analysis (OPLS-DA) and entropy weight method were used to determine the relationship between major characteristic flavors and characteristic substances. The effects of temperature and light flux on the flavor changes of different formulations of UHT milk were analyzed, and a model for comprehensive analysis of the characteristic flavors of UHT milk was developed based on the effects of initial unsaturated fatty acid content, temperature and light flux. The results of this research provide support for the quality control of different formulations of UHT milk
Psychometric assessment of HIV/STI sexual risk scale among MSM: A Rasch model approach
<p>Abstract</p> <p>Background</p> <p>Little research has assessed the degree of severity and ordering of different types of sexual behaviors for HIV/STI infection in a measurement scale. The purpose of this study was to apply the Rasch model on psychometric assessment of an HIV/STI sexual risk scale among men who have sex with men (MSM).</p> <p>Methods</p> <p>A cross-sectional study using respondent driven sampling was conducted among 351 MSM in Shenzhen, China. The Rasch model was used to examine the psychometric properties of an HIV/STI sexual risk scale including nine types of sexual behaviors.</p> <p>Results</p> <p>The Rasch analysis of the nine items met the unidimensionality and local independence assumption. Although the person reliability was low at 0.35, the item reliability was high at 0.99. The fit statistics provided acceptable infit and outfit values. Item difficulty invariance analysis showed that the item estimates of the risk behavior items were invariant (within error).</p> <p>Conclusions</p> <p>The findings suggest that the Rasch model can be utilized for measuring the level of sexual risk for HIV/STI infection as a single latent construct and for establishing the relative degree of severity of each type of sexual behavior in HIV/STI transmission and acquisition among MSM. The measurement scale provides a useful measurement tool to inform, design and evaluate behavioral interventions for HIV/STI infection among MSM.</p
Assessing trend and variation of Arctic sea-ice extent during 1979-2012 from a latitude perspective of ice edge
Arctic sea-ice extent (in summer) has been shrinking since the 1970s. However, we have little knowledge of the detailed spatial variability of this shrinking. In this study, we examine the (latitudinal) ice extent along each degree of longitude, using the monthly Arctic ice index data sets (1979–2012) from the National Snow and Ice Data Center. Statistical analysis suggests that: (1) for summer months (July–October), there was a 34-year declining trend in sea-ice extent at most regions, except for the Canadian Arctic Archipelago, Greenland and Svalbard, with retreat rates of 0.0562–0.0898 latitude degree/year (or 6.26–10.00 km/year, at a significance level of 0.05); (2) for sea ice not geographically muted by the continental coastline in winter months (January–April), there was a declining trend of 0.0216–0.0559 latitude degree/year (2.40–6.22 km/year, at a significance level of 0.05). Regionally, the most evident sea-ice decline occurred in the Chukchi Sea from August to October, Baffin Bay and Greenland Sea from January to May, Barents Sea in most months, Kara Sea from July to August and Laptev Sea and eastern Siberian Sea in August and September. Trend analysis also indicates that: (1) the decline in summer ice extent became significant (at a 0.05 significance level) since 1999 and (2) winter ice extent showed a clear changing point (decline) around 2000, becoming statistically significant around 2005. The Pacific–Siberian sector of the Arctic accounted for most of the summer sea-ice decline, while the winter recovery of sea ice in the Atlantic sector tended to decrease.Keywords: NSIDC ice index; Arctic; sea-ice extent; ice-edge latitude.(Published: 11 September 2014)Citation: Polar Research 2014, 33, 21249, http://dx.doi.org/10.3402/polar.v33.2124
- …